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ABSTRACT 

Increasing  variability  during  manufacturing  and  during  runtime  are 
projected  for  future  generation  microprocessors.  This  paper  intro¬ 
duces  a  pre-RTL,  architectural  modeling  methodology  that  incor¬ 
porates  the  impact  of  manufacturing  and  runtime  temperature  vari¬ 
ations  on  delay  and  power  for  both  combinational  logic  and  SRAM 
structures.  The  model  is  then  used  to  show  that  frequency  varia¬ 
tions  among  microarchitectural  functional  units  and  among  cores 
are  relatively  small  in  a  high-performance  microprocessor  design. 
However,  the  impact  of  within-die  systematic  process  variations  on 
leakage  power  will  result  in  major  leakage  variation  across  multiple 
cores  on  a  single  chip.  WID  leakage  variation  can  cause  core-to- 
core  leakage  to  differ  by  as  much  as  45%. 

1.  INTRODUCTION 

The  2005  International  Technology  Roadmap  for  Semiconduc¬ 
tors  projects  that  parameter  variations  will  present  critical  chal¬ 
lenges  for  manufacturability  and  yield.  While  process,  circuit-de- 
sign,  and  statistical  CAD  techniques  can  mitigate  the  impact  of 
some  parameter  variations,  the  roadmap  and  some  industry  ob¬ 
servers  [3]  have  claimed  that  computer  architecture  will  play  an 
important  role  in  mitigating  the  effects  of  parameter  variations. 

At  the  same  time,  multi-core  designs  have  become  the  dominant 
organization  for  future  microprocessor  chips,  as  high-frequency  sin¬ 
gle  cores  run  into  power,  thermal  and  complexity  limitations  that 
will  only  be  exacerbated  by  future  technology  trends.  The  inclusion 
of  multiple  cores — of  the  same  or  different  types — allows  contin¬ 
ued  exponential  performance  scaling  for  applications  that  can  take 
advantage  of  the  parallelism  that  CMPs  offer.  Multi-core  organi¬ 
zations,  however,  also  multiply  the  ways  in  which  parameter  vari¬ 
ations  can  affect  a  processor.  Although  some  have  speculated  that 
this  will  yield  significant  variations  among  units  in  a  single  core, 
this  paper  argues  that  instead  the  most  important  phenomenon  will 
be  core-to-core  (C2C)  leakage  variations  at  the  45nm  technology 
node  and  beyond. 

Parameter  variations  encompass  a  range  of  variation  types,  in¬ 
cluding  process  variations  due  to  manufacturing  phenomena,  volt¬ 
age  variations  due  to  manufacturing  and  runtime  phenomena,  and 
temperature  variations  due  to  varying  activity  levels  and  power 
dissipations — in  fact,  these  three  main  sources  are  often  referred 
to  as  PVT  (process-voltage-temperature)  variations.  Process  varia¬ 
tions  are  static  and  manifest  themselves  as  die-to-die  (D2D),  within- 
die  (WID)  variations,  and  wafer-to-wafer  variations  (W2W),  while 
temperature  and  voltage  variations  are  a  dynamic  phenomena.  Tem¬ 
perature  variations  stem  from  different  activity  factors  among  cores, 
functional  units,  front  different  circuit  structures,  and  from  non- 
uniformities  in  the  thermal  interface  material  (TIM)  that  bonds  the 
chip  to  its  package.  Voltage  variations  stem  from  IR  drops  that 
result  from  non-ideal  voltage  distribution,  which  in  turn  are  exac¬ 


erbated  by  activity-dependent  IR  drops.  These  are  exacerbated  by 
temperature-dependent  leakage-current  variations  (i.e.,  varying  the 
I  term)  or  switching  activity  that  causes  voltage  droops  due  to  cir¬ 
cuit  inductance  and  possibly  insufficient  decoupling  capacitances. 
These  three  variation  sources  exhibit  a  number  of  feedback  loops. 
Process  variations  affect  leakage,  which  affects  both  voltage  and 
temperature.  Temperature  then  affects  leakage  forming  a  feedback 
loop  between  the  two  parameters  [11], 

This  paper  focuses  on  WID  variations.  D2D  variations  cause 
each  die  on  a  wafer  to  have  different  mean  values  for  a  particular 
parameter.  Gate  length  is  the  most  common  parameter  to  exhibit 
D2D  variation,  and  is  typically  modeled  by  assigning  a  normally 
distributed  offset  to  each  die.  D2D  variations  can  be  dealt  with  by 
sorting  chips  into  different  product  bins  or  chip- wide  techniques  to 
compensate  for  a  parameter’s  offset,  such  as  adaptive  body  bias¬ 
ing  [17].  W2W  variations  primarily  affect  the  shape  of  the  WID 
systematic  pattern  as  well  as  across  wafer  systematic  patterns  that 
in  chip-to-chip  variations  similar  (but  larger  in  magnitude)  to  D2D 
variations.  In  short,  D2D  variations  determine  the  variance  of  the 
frequency  distribution  while  WID  variations  determine  the  mean  of 
the  distribution  [4], 

WID  process  variations  can  further  be  divided  into  random  and 
systematic  variations.  Random  variations  will  affect  each  transistor 
differently,  while  systematic  variations  cause  transistors  to  be  spa¬ 
tially  correlated.  Systematic  variations  may  be  caused  by  a  variety 
of  different  sources.  Most  notably,  variation  in  optical  intensity 
across  the  exposure  field  and  non-uniform  chemical-mechanical 
polishing  that  occurs  due  to  different  pattern  densities. 

This  paper  argues  that  the  WID  variation  phenomenon  of  chief 
interest  in  the  computer  architecture  domain  will  occur  at  a  C2C 
granularity,  rather  than  at  a  unit-to-unit  granularity.  While  unit-to- 
unit  variations  in  delay  will  occur,  the  WID  frequency  distribution 
will  likely  be  dominated  by  large  SRAM  structures.  This  occurs 
because  of  the  nature  in  which  existing  critical  path  models  deter¬ 
mine  worst-case  delay.  The  final  result  is  that  large  SRAM  units 
will  have  a  mean  delay  that  is  much  greater  than  the  rest  of  the 
units.  We  find  that  overall  impact  of  random  variations  on  clock  fre¬ 
quency  to  be  fairly  mild  when  reasonable  assumptions  were  made 
about  each  parameter’s  variance.  At  a  per-unit  granularity,  ran¬ 
dom  variations  in  leakage  are  even  milder  than  frequency  variations 
since  a  stage/unit’s  worst-case  delay  is  the  unit’s  maximum  critical 
path  delay,  while  leakage  is  merely  an  aggregate  sum  across  all 
the  transistors  in  the  unit.  WID  systematic  variation.  WIDsys,  play 
an  important  role,  because  at  the  45nm  generation  and  beyond,  re¬ 
duced  core  areas  will  cause  parameters  within  a  core  to  be  highly 
spatially  correlated,  while  the  amount  of  variation  that  can  occur 
across  a  chip  can  be  large. 

Systematic  variation  will  result  in  both  C2C  frequency  and  leak¬ 
age  variation.  C2C  frequency  variations  will  be  modest  in  compar¬ 
ison  to  leakage  variation.  This  is  because  the  amount  of  WIDS!/<, 
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that  occurs  across  a  chip — 10-15%  variation  in  gate  length — has 
only  a  linear  impact  on  frequency.  Instead,  leakage — which  has  an 
exponential  dependence  on  the  gate-length  variation  (because  of  its 
impact  on  threshold  voltage) — shows  the  most  important  architec¬ 
tural  WID  variation. 

Specifically,  the  main  contributions  of  this  paper  are: 

•  A  top-down  model  for  studying  parameter  variations  at  the 
architectural  level.  This  model  accounts  for  random  and  sys¬ 
tematic  WID  device  variations.  D2D  and  W2W  variations 
are  easily  added  but  not  discussed  further  here.  The  chief  re¬ 
quirement  is  that  the  model  not  require  detailed  circuit  imple¬ 
mentations.  because  early-stage,  pre-RTL  studies,  especially 
for  a  multi-core  chip,  require  an  ability  to  explore  the  design 
space  before  detailed  circuit  implementations  are  available. 

•  An  improved  critical  path  model  is  used  to  analyze  the  likely 
impact  that  each  functional  unit  will  have  in  determining  the 
processor’s  maximum  clock  frequency  distribution.  We  de¬ 
termine  that  SRAM  units  are  likely  to  determine  the  proces¬ 
sor’s  clock  speed,  not  logic  dominated  stages.  In  particular, 
the  LI  caches  will  be  the  primary  limiter,  unless  variation- 
aware  techniques  are  applied. 

•  Using  a  14mm  by  14mm  die  as  our  baseline  chip  model,  we 
show  that  in  a  multi-core  architecture,  C2C  leakage  varia¬ 
tion  can  be  as  much  as  45%  when  the  thermal-feedback  loop 
between  leakage  and  temperature  has  been  closed. 

The  rest  of  the  paper  is  structured  as  follows.  Section  2  gives  an 
overview  of  parameter  variation  phenomena  and  discusses  related 
work.  Section  3  introduces  the  architectural  PVT  model.  Section  4 
looks  at  frequency  variation  at  both  the  functional  unit  level  and 
the  core  level,  Section  5  looks  at  across  chip  leakage  variation,  and 
Section  6  concludes. 

2.  BACKGROUND  AND  RELATED  WORK 
2.1  Background 

Parameter  variations  cause  chip  characteristics  to  deviate  from 
the  uniform,  ideal  values  desired  at  design  time.  Three  major  sources 
of  variation  are  often  discussed:  process  variations,  which  consist 
of  deviations  in  the  manufactured  properties  of  the  chip,  such  as 
feature  size,  dopant  density,  etc.;  voltage  variations  due  to  non- 
uniform  power-supply  distribution,  switching  activity,  and  IR  drop; 
and  temperature  variations  due  to  non-uniformities  in  heat  flux  of 
different  functional  units  under  different  workloads  as  well  as  the 
impact  of  non-uniformities  in  the  chip’s  interface  to  its  package. 
These  comprise  the  classic  “PVT”  variations. 

Process  variations  occur  because  specific  steps  in  the  fabrica¬ 
tion  process,  such  as  lithography,  ion  implantation,  and  chemi¬ 
cal-mechanical  polishing,  are  vulnerable  to  imperfections,  noise, 
and  imperfect  control  across  time  and  locations.  Process  variations 
present  a  problem  because  they  can  make  a  given  circuit  exhibit  dif¬ 
ferent  delay  or  power  characteristics  than  intended  during  design. 
Since  the  operating  frequency  in  high-performance  chips  is  typi¬ 
cally  determined  by  the  expected  delay  of  the  slowest  path,  varia¬ 
tion  in  the  delay  of  the  slowest  path  can  make  a  single,  fixed  clock 
frequency  too  fast  (causing  errors)  or  too  slow  (incurring  an  op¬ 
portunity  cost).  Post-manufacture  testing  is  therefore  used  to  char¬ 
acterize  chips  and  “bin”  them  according  to  their  maximum  clock 
frequency  giving  perhaps  a  30%  variation  among  chips.  Unfortu¬ 
nately,  the  fastest  chips  are  usually  the  leakiest,  because  both  fre¬ 
quency  and  leakage  are  affected  by  one  of  the  main  victims  of  pro¬ 
cess  variations,  the  threshold  voltage.  Threshold  voltage  is  affected 


by  both  fluctuations  in  the  channel  doping  (which  gets  worse  as 
smaller  channel  lengths  mean  fewer  dopant  atoms  are  in  the  chan¬ 
nel  region)  and  the  effective  gate  length  (which  affects  threshold 
voltage  through  Drain  Induced  Barrier  Lowering  (DIBL)).  In  fact, 
subthreshold  leakage  is  exponentially  dependent  on  threshold  volt¬ 
age,  and  this  produces  large  D2D  leakage  variation.  The  fastest 
chips  often  cannot  operate  at  their  peak  sustainable  frequency  be¬ 
cause  they  would  overheat,  and  a  suitable  cooling  solution  is  too 
expensive.  Per-chip  adaptive  body  biasing  (ABB)  [17]  can  reduce 
these  spreads  and  boost  the  yield  of  high-quality  parts  at  the  cost  of 
some  additional  testing  and  calibration. 

Until  recently,  W2W  and  D2D  variations  were  the  main  source 
of  concern,  and  these  could  be  addressed  through  bin  splits  and 
ABB.  However,  as  transistors  scale  in  size,  small,  WID  variations 
in  feature  size  and  doping  density  — once  imperceptible  relative  to 
the  large  features  sizes  in  older  technologies — have  become  impor¬ 
tant  as  their  impact  becomes  larger  in  relative  terms.  As  mentioned 
in  Section  1,  two  forms  of  WID  variations  are  present.  Random 
variations  are  small  changes  from  transistor  to  transistor  which  do 
not  show  any  correlation  across  larger  distances  on  the  chip.  Sys¬ 
tematic  variations,  on  the  other  hand,  exhibit  high  degrees  of  spa¬ 
tial  correlation.  Random  variations  stem  primarily  from  two  main 
sources.  Non-uniform  dopant  implantation  in  the  channel  deple¬ 
tion  region  affect  threshold  voltage,  and  imperfect  control  of  the 
lithographic  process  result  in  non-deterministic  gate  lengths.  Sys¬ 
tematic  variations  in  gate  length  stem  primarily  from  the  the  litho¬ 
graphic  exposure  process.  Non-uniform  exposure  intensity,  lens 
aberrations,  defocus  errors,  and  mask  errors,  as  well  as  many  other 
factors,  may  all  contribute  to  the  final  systematic  variation  pattern. 

While  systematic  variations  are  modeled  as  affecting  all  circuits 
in  a  critical  path  in  the  same  fashion,  random  WID  variations  can 
affect  the  same  circuit  in  a  myriad  of  different  ways.  Analyzing 
all  possible  permutations  is  usually  prohibitive,  requiring  statistical 
treatments  which  have  become  a  major  research  topic  in  the  CAD 
community.  These  variations  are  exacerbated  by  runtime  effects 
like  temperature  and  noise.  To  account  for  the  possible  slowdown 
due  to  PVT  variations,  voltage  margin  must  increased  to  compen¬ 
sate.  The  concern  is  that  as  technology  scales  PVT  variations  will 
increase  in  severity,  resulting  in  worse  required  design  margins. 

2.2  Modeling 

While  there  has  been  a  great  deal  of  work  on  statistical  approaches 
to  modeling  and  compensating  for  variation  in  the  CAD  commu¬ 
nity,  there  has  been  little  work  on  modeling  variations  in  the  archi¬ 
tecture  community.  Yet  parameter  variations  are  important  enough 
that  architectural  mitigation  techniques  need  to  be  explored  before 
or  at  least  in  parallel  with  circuit  design.  This  requires  a  pre-RTL 
modeling  capability  that  does  not  depend  on  detailed  circuit  de¬ 
signs  to  estimate  the  impact  of  parameter  variations  on  different 
microarchitectural  units. 

Perhaps  the  most  relevant  prior  work  is  the  “FMAX”  model  in¬ 
troduced  by  Bowman  et  al.  [4],  FMAX  is  a  predictive  model  for 
capturing  the  maximum  frequency  distribution.  It  is  comprised  of  a 
generic  critical  path  model  (GCP)  that  is  based  on  canonical  NAND 
gates.  The  NAND  gate’s  delay  is  derived  from  the  RC  delay  equa¬ 
tion.  The  delay  distribution  can  then  be  determined  with  Monte 
Carlo  analysis  by  varying  the  delay  equation’s  inputs.  Results  from 
the  GCP  model  were  compared  to  measured  data  from  high  volume 
industrial  0.25um  and  0.13um  processes.  While  the  GCP  model 
did  not  perfectly  recreate  the  measured  frequency  distribution,  it 
did  provide  insight  into  what  the  frequency  distribution  would  look 
like.  In  the  FMAX  model  there  are  two  parameters  of  concern  to 
microarchitects:  number  of  independent  critical  paths,  Ncp,  and  the 
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Figure  1:  Plot  showing  delay  distribution’s  dependency  on  the 
number  of  Ncp. 

depth/length  of  the  critical  path,  Lcp.1 

The  statistical  relevance  of  Ncp  is  that  the  worst  case  delay  of 
a  unit  is  the  maximum  delay  across  all  critical  paths.  As  Ncp  in¬ 
creases,  the  stage’s  mean  delay  will  also  increase.  The  reason  this 
occurs  is  that  when  a  larger  sample  size  is  considered,  the  proba¬ 
bility  increases  that  the  worst-case  is  an  extreme  outlier.  Similarly, 
as  Ncp  increases  the  distribution’s  variance  will  decrease  since  the 
maximum  delay  is  likely  to  be  determined  by  an  outlier.  Fig.  1  il¬ 
lustrates  the  dependency  between  the  delay  distribution  and  Ncp. 
Logic  depth  determines  Lcp.  Path  delay  is  determined  by  taking  an 
aggregate  sum  of  each  gate’s  delay  in  the  path.  Since  a  sum  opera¬ 
tion  is  performed,  the  path’s  ratio  of  variance  to  mean  will  decrease 
as  Lcp  increases. 

This  paper  primarily  focuses  on  leakage  variation  that  occur  as  a 
result  of  systematic  effective  gate  length  variations.  Zhang 

et  al.  [18]  showed  that  it  is  necessary  to  consider  systematic  L e// 
variations  when  estimating  full-chip  subthreshold-leakage.  Ashouei 
et  al.  [2]  developed  a  model  that  addresses  systematic  WIDsys  leak¬ 
age  variation  at  the  circuit  level.  Systematic  variations  are  modeled 
as  circular  areas  with  highly  correlated  Le//  values.  The  corre¬ 
lated  areas  may  vary  in  their  area,  location,  and  magnitude.  Our 
modeling  methodology  differs  from  this  since  we  base  our  WID 
systematic  variation  pattern  off  of  measured  data  reported  in  [6, 
14]. 

It  is  necessary  to  emphasize  that  the  pattern  of  the  WID  system¬ 
atic  variation,  (WID3ys),  is  highly  dependent  upon  the  fabrication 
process.  They  can  manifest  themselves  as  being  either  determin¬ 
istic  or  random  in  nature  depending  on  the  particular  fabrication 
process.  Deterministic  systematic  variations  can  be  mitigated  with 
a  combination  of  optimal  proximity  correction,  phase-shift  mask¬ 
ing,  as  well  as  other  mask-level  techniques.  Since  masks  cost  are 
already  burdensome  and  increasing  with  every  technology  node, 
design-for-manufacture  (DFM)  techniques  that  simplify  mask  com¬ 
plexity  with  variation-tolerant  designs  are  desired. 

The  main  advantage  of  modeling  a  measured  deterministic  sys¬ 
tematic  pattern  is  to  better  understand  at  what  granularity  the  sys¬ 
tematic  change  will  occur  at,  and  how  this  will  affect  decisions  in 
the  architectural  domain. 

2.3  Architectural  Implications 

In  [13],  Marculescu  and  Talpes  propose  to  apply  the  FMAX 

'in  [4],  the  authors  use  the  notation  ncp  to  represent  logic  depth. 
In  order  to  avoid  confusion  between  ncp  and  Ncp,  we  refer  to  logic 
depth  as  Lcp 


model  in  the  microarchitecture  domain  by  assuming  that  Ncp  is 
proportional  to  the  stage’s  device  count.  While  this  assumption  of¬ 
ten  times  holds  true,  it  is  not  always  the  case  since  not  all  paths 
are  critical  [1,  9].  Also,  the  authors  do  not  consider  that  a  large 
portion  of  a  stage’s  delay  will  be  spent  in  wires  when  estimating 
Lcp.  While  these  assumptions  provide  a  simplistic  way  to  reason 
about  how  variations  affect  the  FMAX  distribution,  it  may  be  mis¬ 
leading  when  analyzing  the  impact  that  each  particular  unit's  delay 
distribution  will  have  on  the  final  FMAX  distribution.  The  author’s 
proceed  to  show  how  a  GALS  architecture  can  mollify  the  impact 
of  process  and  temperature  variations. 

Ernst  et  al.  [9]  also  use  a  GALS  architecture,  but  use  shadow 
latches  on  critical  paths  to  dynamically  correct  and  detect  circuit 
timing  errors.  With  this  added  functionality,  the  authors  show  that 
significant  power  savings  can  be  obtained  by  performing  per  stage 
DVS  in  order  to  reclaim  design  margin.  The  focus  of  this  work, 
however,  is  on  reclaiming  excess  design  margins  in  single-core 
chips,  including  D2D  variations  and  not  just  unit-to-unit  and  run¬ 
time  voltage-temperature  variations. 

Neither  paper  considers  the  impact  of  C2C  variations.  Our  pa¬ 
per  present  a  more  detailed  modeling  methodology  and  shows  the 
importance  of  C2C  phenomena.  The  main  contributions  of  the 
model  are:  (i)  Stage-specific  Ncp  and  Lcp  characteristics,  most  im¬ 
portantly  the  differences  between  SRAM  and  combinational  logic; 
(ii)  Systematic  Le//  variations;  (iii)  The  importance  of  leakage, 
as  opposed  to  frequency,  as  a  consequence  of  WID  variations  and 
resulting  design  driver. 

3.  VARIATION  MODEL 
3.1  Critical  Path  Model 

To  model  the  impact  of  parameter  variations  upon  delay,  we  ob¬ 
serve  that  the  clock  frequency  is  dictated  by  the  worst-case  delay 
for  any  pipeline  stage.  Similarly,  the  delay  of  each  pipeline  stage 
is  determined  by  the  worst-case  delay  across  all  the  stage’s  critical 
paths.  Frequency  is  therefore  given  by  MAX(Tcp),  i.e.  the  worst- 
case  delay  of  all  critical  paths.  The  delay  of  each  critical  path, 
in  turn,  can  be  decomposed  into  D2D,WID-random,  and  WIDs,,,, 
variations: 

Tcp  —  Tcp,nom  +  ATD2D  +  ATv VID-rand  +  ATWID_syi  1) 
+  ^TemP  +  Ay 

where  TcP:Tlorn  is  the  nominal  critical  path  delay  without  variations, 
ATd2d  is  the  contribution  of  D2D  variation,  which  is  a  fixed  off¬ 
set  per  die;  ATwiD-sys  is  the  WID  contribution  of  systematic 
variations;  and  ATwiD-rand,  is  the  accumulated  contribution  of 
random  variations  across  that  critical  path.  A  Temp  and  Ay  in  turn 
represent  the  additional  impact  of  temperature  and  voltage  varia¬ 
tions. 

To  understand  the  role  that  each  pipeline  stage  plays  in  deter¬ 
mining  the  processor’s  final  frequency  distribution,  we  model  de¬ 
lay  variations  at  a  per  functional  unit  granularity.  For  a  critical  path 
model  to  be  useful  for  architectural  studies,  the  model  should  be 
able  to  recognize  the  inherent  differences  between  different  func¬ 
tional  units’  circuit  structures.  With  this  information,  the  model 
should  then  be  able  reason  about  the  processor’s  frequency  distri¬ 
bution. 

A  main  assumption  in  our  model  is  that  all  stages  can  be  loosely 
categorized  as  being  dominated  by  either  SRAM  or  combinational 
logic.  SRAM-dominated  stages  include  not  only  cache/TLB  stages, 
but  those  that  involve  large  buffers,  queues,  or  lookup  tables.  Com¬ 
binational  logic  will  have  a  much  larger  Lcp  than  an  SRAM  stage 
since  a  large  portion  of  an  SRAM  device’s  total  delay  is  spent  in 


wires  (bitlines,  wordlines,  etc.)  rather  than  transistors.  Therefore, 
in  an  SRAM  device,  transistors  will  only  contribute  to  a  small  frac¬ 
tion  of  the  stage's  total  delay.  Even  in  logic-dominated  stages  such 
as  an  ALU,  it  is  expected  that  a  significant  portion  of  the  overall 
delay  will  be  spent  in  the  interconnects,  but  wire-friendly  circuit 
implementations  can  be  used  to  minimize  the  amount  of  intercon¬ 
nect  delay  [12],  Prior  critical  path  delay  models  did  not  consider 
the  ratio  of  wire  delay  to  transistor  delay,  causing  their  conclusions 
to  be  overly  pessimistic. 

It  should  be  noted  that,  while  wire  delay  is  not  exempt  from  man¬ 
ufacturing  variations,  it  is  the  general  consensus  that  their  impact 
will  pale  in  comparison  to  that  of  transistor-level  variation.  The  rea¬ 
son  for  this  is  that  wire  geometries  are  not  as  aggressively  scaled  as 
gate  length.  For  simplicity,  we  have  chosen  not  to  model  wire  vari¬ 
ation  in  this  study.  Future  work  includes  analyzing  the  interaction 
between  WID3J/S  Le/ /  and  WIDsys  wire  variations. 

The  other  main  difference  between  logic  and  SRAM  stages  is 
the  value  of  Ncp.  The  critical  path  in  any  SRAM  is  in  accessing 
the  actual  cell  through  a  wordline  and  sensing  the  voltage  differ¬ 
ence  on  the  bitline  with  the  help  of  the  sense  amplifiers.  Since 
both  wordline  and  bitlines  have  to  be  brought  back  to  their  initial 
state  before  a  new  access  can  begin,  this  critical  path  forms  a  loop 
with  itself.  As  a  consequence,  this  critical  path  determines  not  just 
the  access  time,  but  also  the  minimum  cycle  time  in  a  pipelined 
cache — pipelining  this  critical  path  will  be  increasingly  difficult  as 
variations  worsen.  We  model  the  number  of  critical  paths  in  an 
SRAM  as  the  number  of  bits  in  the  cache  multiplied  by  the  number 
of  read  ports  per  bit.2  Prior  models  did  not  consider  that  each  read 
port  is  equally  critical,  but  rather  treated  each  SRAM  cell  as  being 
one  critical  path. 

Identifying  the  value  of  Ncp  in  a  logic  dominated  stage  is  more 
complicated  than  in  an  SRAM.  In  a  standard  circuit  design,  only 
a  subset  of  the  total  paths  are  actually  critical.  However,  circuit 
designers  increase  the  delay  of  non-critical  paths  in  order  to  re¬ 
duce  dynamic  and  static  power  dissipation,  potentially  causing  non- 
critical  paths  to  become  critical. 

The  inherent  differences  between  SRAM  and  logic  circuitry  ne¬ 
cessitate  different  critical  path  models.  In  order  to  estimate  the  im¬ 
pact  of  process  variations  on  SRAM  structures,  we  have  modified 
a  beta  version  of  CACTI  4.0  to  incorporate  the  effects  of  process 
variations  on  delay.  More  detail  will  be  provided  about  this  model 
in  the  following  section. 

For  simplicity,  the  logic  critical  path  model  is  based  on  conven¬ 
tional  static  adder  circuitry.  Although  representing  all  logic  stages 
with  an  adder  is  an  idealistic  assumption,  we  feel  that  this  simplifi¬ 
cation  still  provides  important  insights  that  allow  architects  to  draw 
useful  conclusions. 

The  combinational  logic  critical  path  model  is  based  off  of  a 
Sklansky  adder.  The  Sklansky  adder  is  not  as  heavily  impacted  by 
wire  delay  in  comparison  to  similar  prefix  adders,  such  as  a  Kogge- 
Stone  [12].  We  assume  the  critical  path  in  an  adder  is  determined 
by  the  time  required  to  pass  the  carry-bit  from  the  least  significant 
bit  to  the  most  significant  bit.  The  entire  delay  for  the  adder  is  the 
carry-bit  propagate  delay  as  well  as  the  delay  of  the  initial  carry 
generate  and  the  sum  logic.  Fig  2  illustrates  the  critical  path  in  a 
Sklansky  adder.  For  simplicity,  only  the  carry-bit’s  path  is  shown. 
One  drawback  of  the  Sklansky  adder  is  that  the  number  of  fanouts 
double  at  each  level.  The  high  fanouts  make  it  important  to  prop¬ 
erly  size  gates  on  the  critical  path,  or  else  high  performance  would 
not  be  obtainable.  Transistor  widths  were  chosen  such  that  a  64 
bit  adder’s  nominal  delay  fell  in  accordance  with  data  extrapolated 

2We  have  neglected  write  ports  on  the  assumption  that  they  are  not 
on  the  critical  path. 


Figure  2:  Critical  path  in  a  Sklansky  adder  is  highlighted.  The 
critical  path  is  assumed  to  be  the  delay  required  to  propagate 
the  carry  bit  from  the  least  significant  bit  to  the  most  significant 
bit. 


from  [12],  and  curve  fitted  to  a  45nm  technology  node.  We  as¬ 
sume  that  35%  of  the  total  delay  in  a  64  bit  Sklansky  adder  can  be 
attributed  to  interconnect  delay. 

In  order  to  estimate  delay,  we  used  the  same  delay  model  with 
which  CACTI  models  decoder  logic.  More  information  about  the 
delay  model  can  be  found  in  [10],  One  advantage  of  using  this 
delay  model  is  that  it  takes  into  account  that  transistor  delay  is  de¬ 
pendent  on  the  load  of  the  input  signal.  By  properly  modeling  this, 
the  correlation  in  delay  between  adjacent  gates  is  accounted  for. 
which  some  prior  models  have  neglected.  All  gates  in  the  critical 
path,  except  for  buffers  and  inverters,  require  two  input  signals,  one 
from  the  previous  gate  in  the  critical  path  and  the  other  from  the  bit 
slice  (white  squares  in  Fig.  2).  The  critical  path  model  only  con¬ 
siders  variations  on  the  input  signal  from  gates  in  the  critical  path. 
The  reason  variations  on  inputs  received  from  bit  slice  logic  is  not 
factored  into  the  delay  model  is  because  the  bit  slice  path  is  not  crit¬ 
ical  and  the  bit-slice  path’s  result  will  be  computed  well  before  the 
carry-bit  signal  will  have  arrived.  However,  this  assumption  may 
be  idealistic  since  it  is  common  for  circuit  designers  to  increase  the 
delay  of  non-critical  paths  in  order  to  save  power.  Prior  variation 
models  treat  each  gate’s  delay  as  being  independent  of  one  another. 
While  greatly  simplifying  the  analysis,  a  model  intended  for  more 
thorough  comparative  analysis  should  consider  this. 

The  improved  critical  path  model  does  have  its  limitations.  Due 
to  the  characteristics  of  the  delay  model,  the  critical  path  model 
cannot  account  for  delay  variations  that  occur  in  transistors  in  se¬ 
ries.  For  this  reason,  the  critical  path  model  cannot  traverse  paths 
that  flow  though  the  NOR  pull-up  and  NAND  pull-down  networks. 
Also,  we  neglect  the  impact  that  variation  in  gate  capacitance  has 
on  the  previous  gate’s  output  signal.  Finally,  SRAM  delay  calcula¬ 
tions  do  not  take  into  account  bitline  leakage. 

3.2  Random  Process  Variations 

Random  variations  are  modeled  as  being  normally  distributed 
parameters.  In  this  study  we  consider  Cox,  Le//,We//,  and  Vth- 
Both  SRAM  and  combinational  logic  critical  path  models  use  a  first 
order  RC  model  for  all  elements.  For  such  a  model,  the  impact  of 


varied  parameters  on  gate  resistance  can  be  expressed  as: 

p  _  ^  Lef  f  _ 1 _ 

pCox  w  vdd  -  vth 

(3) 

This  is  performed  with  a  brute-force  Monte-Carlo  analysis  on 
Ncp  critical  paths  to  determine  the  unit’s  delay  distribution. 

Transistor  width  becomes  a  very  important  parameter  when  com¬ 
paring  different  function  units.  The  reason  for  this  is  because  the 
magnitude  of  random  dopant  fluctuations  in  Vth  is  proportional  to 
transistor  width, We//. 


For  SRAM  functional  units,  we  assume  the  LI  cache  to  have 
minimal  W eff.  For  all  other  SRAM  units,  we  assume  We//  to  be 
5  times  the  minimal  value.  This  fact,  together  with  the  large  size  of 
caches,  makes  them  most  likely  to  exhibit  the  worst  variation. 

Table  1  shows  our  baseline  assumptions  for  the  variability  of 
a  minimum  size  transistor.  These  values  were  extrapolated  from 
ITRS  and  academic  predictions  [15,  8]. 


Name 

3a/p 

Leff 

12% 

Vth 

30% 

cox 

10% 

W  eff 

4% 

Table  1:  Default  3cr/p  for  parameters  varied. 

3.3  Systematic  Process  Variations 

Across  chip  Le/ /  variations  arise  from  imperfections  in  the  fab¬ 
rication  process.  The  optical  component  that  we  model  is  chiefly 
due  to  lens  aberrations  that  can  be  modeled  as  a  simple  polynomial 
function  of  position  within  the  held  of  exposure  [6].  The  equation 
can  be  approximated  by 

Leff  =  a-x 2  +  b-y 2  +  c-x  +  d-y  +  e-xy  +  intercept  (5) 

where  x  and  y  are  the  coordinates  on  the  chip’s  surface.  Base¬ 
line  values  derived  from  [6]  and  scaled  to  45nm  are  given  in  Ta¬ 
ble  2.  They  were  chosen  under  the  assumption  that  the  proportion 
of  WIDSJ/S  to  mean  Le/ /  will  stay  constant  with  scaled  dimensions. 
In  our  model  systematic  variations  will  cause  there  to  be  a  12%  dif¬ 
ference  between  nominal  L ef  /  and  the  area  of  the  chip  having  the 
largest  Le//. 


pre-RTL  architectural  treatment  and  the  fact  that  within  each  core, 
SRAMs  dominate  both  the  core’s  operating  frequency  and  its  leak¬ 
age,  exhibit  a  regular  pattern  density,  and  have  near  minimum-size 
features.  Also,  Orshansky  et  al.  [14],  measure  WIDsys  for  various 
circuit  layouts,  and  show  the  majority  of  circuit  layouts  will  have  a 
similar  bowl-like  pattern  across  the  chip. 

Ultimately,  variations  in  gate  length  matter  because  they  affect 
threshold  voltage,  which  determines  both  switching  speed  and  leak¬ 
age.  In  [7],  the  authors  present  an  equation  for  determining  Vth  as 
a  function  of  Le//: 

Vth eff  =  Vtho  —  Vdd  ■  exp(—a>DiBL  •  Leff)  (6) 

where  Vtho  is  the  threshold  voltage  for  long  channel  transistors, 
0.22;  aDiBL  is  the  DIBL  coefficient,  0.15;  and  Vdd  is  the  supply 
voltage,  IV.  The  default  values  for  V^o  and  cxdibl  were  provided 
in  [5].  This  equation  highlights  an  important  concept:  as  Le//  in¬ 
creases  Vth  will  also  increase.  This  is  why  leakage  has  an  expo¬ 
nential  relationship  on  Le//. 

4.  FREQUENCY  VARIATION 

In  Fig.  3  the  delay  distribution  of  several  of  the  more  interesting 
SRAM  functional  units  is  shown.  The  delay  distribution  only  con¬ 
siders  the  SRAM  cell  and  the  delay  variation  in  the  decode  logic 
is  not  taken  into  account.  The  figure  illustrates  an  important  con¬ 
cept:  not  all  units/stages  will  directly  contribute  to  the  final  WID 
frequency  distribution.  Table  3  shows  the  parameters  correspond¬ 
ing  to  each  unit’s  delay  distribution.  In  Sec.  2,  it  was  mentioned 
that  variance  decreases  as  Ncp  is  increased;  however,  the  64KB  LI 
cache  has  a  greater  variance  than  the  other  two  units  even  though 
it  has  a  much  larger  Ncp.  The  reason  for  this  is  that  the  LI  has 
minimum  sized  W eff,  and  according  to  Eq.  4,  this  causes  the  LI 
to  have  greater  aVt.h-  As  can  be  seen,  either  the  LI  D-cache  or 
I-cache  is  likely  to  be  the  slowest  SRAM  unit  because  of  the  large 
Ncp.  The  reason  that  variation  in  the  SRAM  cell  only  results  in  a 
5%  performance  degradation  is  that  SRAM  access  time  is  a  combi¬ 
nation  of  bit  line  delay,  wordline  delay,  and  sense-amp  delay.  Ac¬ 
cording  to  our  modeling  methodology,  process  variations  will  only 
significantly  impact  bit-line  delay.  In  conclusion,  even  though  the 
variation  in  bit-line  delay  can  be  relatively  large,  it  will  not  have 
a  great  impact  on  overall  access  time  since  only  a  fraction  of  the 
overall  access  time  is  susceptible  to  process  variations. 


Name 

Entries 

Line  Size(bits) 

ports 

Ncp 

We//(nm) 

RF 

120 

64 

6 

46080 

375 

TLB 

1024 

64 

1 

65536 

375 

LI 

512 

1024 

1 

524288 

75 

Table  3:  Description  of  functional  units  plotted  in  Fig.  3. 


Parameter 

Value 

a 

5.37  x  10~4nm/mnV 

b 

1.829x1 0~a  nm/nW 

c 

-1.06x  lO^nm/mm 

d 

-.458  nm/mm 

e 

-1.67  x  10~Jnm/mm 

intercept 

28.0  nm 

Table  2:  Constants  for  our  2nd  order  polynomial  equation  for 
modeling  WID  systematic  variations 

Our  model  assumes  that  all  circuit  types  within  a  core  are  af¬ 
fected  uniformly  by  WIDsys,  neglecting  the  impact  of  pattern  den¬ 
sity  ,  orientation,  and  sizing.  This  is  justified  both  by  the  high-level. 


In  Figure  4,  the  delay  distribution  of  the  combinational  logic 
model  is  compared  to  the  slowest  SRAM  stage  (64KB  LI  cache). 
Mean  combinational  logic  delay  is  significantly  less  than  the  LI 
caches’  mean  delay  since  Ncp  is  equal  to  1  in  the  combinational 
logic  model.  For  this  same  reason,  the  logic  delay  distribution  also 
has  a  much  greater  variance  since  variance  decreases  as  Ncp  in¬ 
creases. 

The  simplistic  critical  path  model  shows  that  the  WID  frequency 
distribution  of  the  processor  core  will  solely  be  determined  by  the 
LI  caches.  The  primary  reasons  for  this  is  that  the  LI  cache’s  have 
greater  Ncp  than  all  other  SRAM  structures,  causing  the  delay  dis¬ 
tribution’s  mean  to  increase.  If  caches  are  removed  from  consider¬ 
ation  (e.g.,  by  allowing  multi-cycle  access),  then  TLBs  and  other 
SRAM  structures  dominate  With  nominal  WIDsys,  combinational 
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%  Frequency  Slowdown  Due  to  Random  Process  Variations 


Figure  3:  Cacti  generated  delay  distribution  for  several  differ¬ 
ent  SRAM  functional  units 

logic  stages  will  be  much  faster  than  SRAM  because  of  their  low 
Ncp.  However,  when  D2D  variations  are  considered  it  is  possible 
for  combinational  logic  to  have  delay  greater  than  the  LI  since  logic 
stages  are  more  sensitive  to  changes  in  L e//.  Unit-to-unit  delay 
variations  will  contribute  to  clock  skew,  which  does  have  an  impact 
on  the  maximum  frequency.  Determining  the  impact  of  unit-to- 
unit  variations  on  clock  skew  is  more  of  a  circuit-design  rather  than 
a  pre-RTL  architectural-modeling  issue,  and  hence  is  beyond  the 
scope  of  this  paper. 


%  Frequency  Slowdown  Due  to  Random  Process  Variations 


Figure  4:  Comparison  of  logic  stage’s  delay  distribution  to  that 
of  the  slowest  SRAM  stage.  Logic  stage  is  modeled  as  having 
Ncp  of  1  and  Lcp  of  14. 

Systematic  Le/ /  variations  will  have  a  detrimental  impact  on  de¬ 
lay  since  resistance  is  dependent  on  both  Le//  and  Nth-  The  mag¬ 
nitude  of  the  effect  depends  upon  the  functional  unit’s  ratio  of  logic 
delay  to  wire  delay,  the  change  in  Le//,  and  how  problematic  the 
DIBL  effect  is  in  the  particular  process.  Intuitively,  logic  domi¬ 
nated  stages  will  be  more  impacted  by  systematic  Le/ /  variations 
than  an  SRAM  stage  since  logic  stages  are  more  transistor  domi¬ 
nated  than  an  SRAM.  This  is  illustrated  in  Fig.  5.  Data  in  this  figure 
was  gathered  by  a  Monte-Carlo  analysis  with  the  mean  Le/ /  value 


Figure  5:  Distributions  of  a  logic  stage  and  a  64KB  cache  when 
random  variations  and  12%  systematic  variations  are  consid¬ 
ered. 

being  varied  from  0%  to  12%  in  order  to  represent  WIDsys.  As  the 
figure  shows,  WIDS!/S  will  more  severely  affect  the  performance  of 
logic  stages  than  SRAMs.  Fig.  5  also  shows  that  WIDsys  will  result 
in  C2C  frequency  variation  with  the  mean  difference  between  the 
fastest  core  location  and  the  slowest  core  location  being  less  than 
5%. 

In  summary,  when  considering  only  WIDrand  variations,  the  LI 
cache  will  determine  the  processor’s  FMAX  distribution,  resulting 
in  an  average  5%  performance  loss.  The  degradation  is  likely  to 
affect  all  cores  on  a  chip  to  a  similar  degree,  because  the  variance 
in  this  value  will  be  small  (<  1%).  Since  delay  exhibits  a  linear 
dependency  on  systematic  variations,  a  12%  degradation  in  Le// 
will  result  in  an  average  frequency  degradation  of  roughly  5%.  The 
combination  of  WIDra„d  and  WIDsys  results  in  a  10%  frequency 
degradation  for  the  slowest  location  on  the  die.  A  12%  change  in 
L eff  is  a  worst  case  assumption,  so  according  to  our  model,  the 
difference  between  the  frequency  of  the  fastest  core  location  and 
the  slowest  core  location  will  be  less  than  5%.  It  is  worth  noting 
that  the  C2C  variation  is  likely  to  be  less  than  5%  if  the  relationship 
between  leakage  and  SRAM  access  time  were  considered  in  our 
delay  model.  The  reason  for  this  being  that  higher  leakage  slows 
down  SRAM  access  time,  and  the  fastest  cores  will  have  the  most 
leakage. 

5.  LEAKAGE  VARIATION 

In  the  previous  section  we  showed  why  WID  variations  will  not 
play  a  large  role  in  determining  the  C2C  frequency  distribution. 
When  turning  to  leakage,  this  is  not  the  case.  As  mentioned  previ¬ 
ously,  WIDrand  leakage  variation  will  not  be  significant  at  a  course 
enough  granularity  to  concern  microarchitects  since  variation  is  av¬ 
eraged  out  when  a  sum  operation  is  performed.  Fig.  6  illustrates 
this.  The  distributions  of  the  leakage  summation  across  1,  2,  and  4 
transistors  is  showed.  Each  distribution  is  normalized  to  its  smallest 
value  in  order  to  compare  the  variance  of  each  distribution. 

On  the  other  hand,  WIDS!/S  L ef  f  variations  will  shift  the  thresh¬ 
old  voltage  of  all  transistors  in  a  core  by  an  offset.  Nth  has  an 
exponential  effect  on  the  overall  leakage  of  a  core,  as  opposed  to 
the  linear  effect  of  threshold  variations  on  frequency.  The  magni¬ 
tude  of  the  across  chip  leakage  variation  is  dependent  on  both  L ef  f 
and  the  DIBL  coefficient.  Fig.  9  illustrates  the  relationship  between 
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Figure  6:  Normalized  aggregate  leakage  across  1, 2,  and  4  tran¬ 
sistors.  Each  distribution  is  normalized  to  the  smallest  value  in 
the  distribution  so  that  variances  can  be  compared. 


these  parameters  and  leakage.  In  all  leakage  calculations,  the  feed¬ 
back  loop  between  temperature  and  leakage  has  been  closed.  Ther¬ 
mal  calculations  were  performed  using  Hotspot  [16].  By  modeling 
the  thermal-leakage  feedback,  more  accurate  leakage  estimations 
can  be  obtained,  since  leakier  cores  will  have  a  higher  power  den¬ 
sity,  and  therefore,  higher  temperatures  than  cores  with  less  leak¬ 
age.  Since  leakage  is  exponentially  dependent  on  temperature,  the 
feedback  loop  will  exacerbate  C2C  leakage  variation.  It  is  also  im¬ 
portant  to  recognize  that,  because  of  the  characteristics  of  the  poly¬ 
nomial  equation  used  to  model  systematic  variation,  the  worst-case 
leakage  value  is  independent  of  the  DIBL  coefficient:  worst-case 
leakage  occurs  in  area  of  the  die  having  nominal  Le//.  Changing 
the  DIBL  coefficient  results  in  more  C2C  leakage  variation  because 
this  parameter  determines  the  leakage  of  the  core  that  is  located  in 
the  area  of  the  die  having  the  largest  Le/ /  value.  The  leakage  in 
this  area  of  the  die  will  increase  with  the  oldibl  .  Simply  put,  pa¬ 
rameter  values  that  are  good  for  performance  (small  Le/ /  and  Vth 
values)  are  worse  for  leakage. 

The  situation  that  we  analyze  is  one  in  which  the  entire  expo¬ 
sure  field  is  28mm  by  28mm,  with  the  reticle  being  comprised  of  4 
identical  14mm  by  14mm  dies.  The  WIDsas  L e//  pattern  is  trans¬ 
posed  onto  a  grid  and  the  resulting  across  chip  systematic  Leff 
pattern  is  depicted  in  Fig.  8.  This  was  derived  using  Eq.  5,  and 
the  baseline  constants  in  Table  2.  We  consider  a  POWER4-like 
core  scaled  to  45nm  dimensions.  Assuming  constant  scaling,  the 
core  area  will  be  2.5mm  by  2.25mm.  In  order  to  gather  the  leak¬ 
age  distribution,  all  possible  core  positions  on  the  chip’s  surface 
are  considered.  Sub-threshold  leakage  is  determined  by  taking  the 
aggregate  sum  of  the  leakage  in  the  core’s  underlying  grid  cells. 
The  C2C  leakage  distribution  for  all  possible  core  positions  on  a 
die  is  shown  in  Fig.  7.  The  skewed  C2C  leakage  distribution  oc¬ 
curs  because  of  the  polynomial  nature  of  the  systematic  equation 
variation  the  resulting  distribution  is  negatively  skewed.  As  men¬ 
tioned  earlier,  closing  the  thermal  feedback  loop  exacerbates  C2C 
leakage. 

In  contrast,  random  variations  in  leakage  will  not  be  of  particular 


Figure  7:  Leakage  distribution  of  all  possible  core  positions  on 
a  die’s  surface  for  different  o.dibl  values. 

interest  in  the  architectural  domain.  The  reason  for  this  is  that  the 
leakage  in  a  core  and  unit  is  an  aggregate  sum  of  the  leakage  in 
the  underlying  transistors.  When  a  summation  is  taken  across  a 
large  enough  sample,  very  little  variation  in  the  mean  will  occur 
because  of  the  “averaging  effect”  that  occurs  when  a  sum  operation 
is  performed  on  random  variables. 

6.  CONCLUSIONS  AND  FUTURE  WORK 

This  paper  presents  a  model  that  allows  microarchitects  to  reason 
about  how  WID  process  variations  may  affect  a  multi-core  environ¬ 
ment.  The  model  is  based  on  an  abstract  representation  of  combi¬ 
national  logic  and  SRAM  structures,  and  accounts  for  the  way  logic 
depth  (Lcp)  and  the  number  of  independent  critical  paths  (Ncp)  af¬ 
fect  delay  distribution.  Using  the  model,  this  paper  shows  that: 

•  Unit-to-unit  variations  within  a  single  core  are  likely  to  be 
dominated  by  SRAM  structures. 

•  WID  random  variations  will  not  materially  affect  the  C2C 
distributions — all  each  core  is  likely  to  be  impacted  by  ran¬ 
dom  variations  to  a  similar  degree. 

•  The  impact  of  WID  systematic  variations  on  the  C2C  fre¬ 
quency  distribution  will  be  minimal. 

•  The  exponential  relationship  of  leakage  on  Vth,  and  of  Vth 
on  Le//,  means  that  WID^,,  variations  will  produce  C2C 
leakage  variations  up  to  45%. 

These  results  suggest  that  pre-RTL  PVT  modeling  is  important  for 
future  multi-core  designs.  The  goal  of  this  work  is  not  to  dismiss 
the  importance  of  random  variations  within  individual  cores,  but 
rather  to  argue  that  the  impact  of  random  variations  chiefly  mani¬ 
fests  at  a  circuit  level  of  abstraction,  where  optimizing  the  length 
and  number  of  critical  paths  will  be  most  fruitful.  The  impact  of 
systematic  variations,  on  the  other  hand,  chiefly  manifests  at  an 
architectural  level.  Our  results  suggest  that  the  real  focus  of  archi¬ 
tectural  techniques  for  addressing  WID  process  variations  should 
therefore  explore  variation-tolerant  integration  of  cores,  rather  than 
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Figure  8:  2D  contour  map  of  across  chip  Le/ /  variation  in  nm 


within-core  techniques  for  balancing  out  variation  among  units. 
This  might  involve  novel  leakage-  and  temperature-aware  schedul¬ 
ing  techniques,  the  capability  for  multiple  cores  to  operate  at  inde¬ 
pendent  voltage  and  frequency,  variation-aware  per-core  leakage- 
mitigation  techniques,  and  so  forth.  Note  that  techniques  like  Ra¬ 
zor  [9]  may  still  be  needed  to  reclaim  excess  design  margins.  Razor¬ 
like  techniques  also  provide  the  opportunity  to  design  for  typical- 
case  variations,  relying  on  error-recovery  support  like  Razor  shadow 
latches  for  exceptional  runtime  conditions  and  unusual  input  val¬ 
ues.  Our  conclusions  are  predicated  on  an  “FMAX”  assumption, 
namely  that  the  clock  speed  is  determined  by  the  worst-case  delay 
through  any  critical  stage  (or  nearly  so,  e.g.  3cr).  The  opportunities 
for  within-core  variation  mitigation  are  larger  if  the  clock  speed  is 
in  fact  determined  by  average-  or  common-case  delay,  which  will 
cause  many  paths  to  violate  timing  integrity  (sometimes  intermit¬ 
tently,  for  paths  where  temperature  and  voltage  are  the  determining 
factor).  In  addition  to  Razor,  a  variety  of  other  fault-tolerance  tech¬ 
niques  may  be  helpful. 

Improving  the  model’s  fidelity  is  an  obvious  direction  for  future 
work.  Exploring  the  relationship  between  WIDs^s  correlation  dis¬ 
tance  and  core  size  is  an  especially  important  aspect.  A  sensitivity 
study  on  the  impact  of  different  magnitudes  of  the  random  and  sys¬ 
tematic  variation  phenomena  is  also  needed.  Extending  the  model 
account  for  D2D  and  W2W  variations  may  be  valuable  too,  as  ar¬ 
chitectural  techniques  may  be  able  to  mitigate  these  effects. 
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